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Abstract 



o 

u 

. An important application of distance geometry to biochemistry studies the embeddings of the 

' vertices of a weighted graph in the three-dimensional Euclidean space such that the edge weights are 

equal to the Euclidean distances between corresponding point pairs. When the graph represents the 
backbone of a protein, one can exploit the natural vertex order to show that the search space for 
feasible embeddings is discrete. The corresponding decision problem can be solved using a binary 
, tree based search procedure which is exponential in the worst case. We discuss assumptions that 

\Q • bound the search tree width to a polynomial size. 

£SJ ' Keywords: Branch-and-Prune, symmetry, distance geometry. 
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1 Introduction 



We study the following decision problem [S]: 



Discretizable Molecular Distance Geometry Problem (DMDGP). Given a simple 



undirected weighted graph G = (V, E, d) where d : E — > M + , V is ordered so that V = [n] = 
{1, . . . , n}, and the following assumptions hold: 

1. for all v > 3 and u G V with I < v — u < 3, {u,v} <E E (DISCRETIZATION) 

2. for all v > 3, E contains all edges {u, w} with u^w£U v = {uEV \ 1 < v — u < 3}, 
and the distances d uw with u ^ w G U v obey the strict simplex inequalities [T] (Strict 
Simplex Inequalities), 

and given an embedding x' : [3] — » R 3 , is there an embedding x : V — ¥ R 3 extending x' , such 
that 

\/{u,v} e E \\x u - x v \\ = d uv ? (1) 

Note that the strict simplex inequalities in R 3 reduce to the strict triangular inequalities d v -3 >v -i < 
d„_3 !t ,_2 + d v —2,v—i- An embedding x extends an embedding x' if x' is a restriction of x; an embedding 
is feasible if it satisfies (Q]). We also consider the following problem variants: 
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• DMDGPif , i.e. the family of decision problems (parametrized by the positive integer K) obtained 
by replacing each symbol '3' in the DMDGP definition by the symbol 'K'; 

• the K DMDGP, where K is given as part of the input (rather than being a fixed constant as in the 
DMDGP*;)- 

We remark that DMDGP=DMDGP3. Other related problems also exist in the literature, such as the 
Discretizable Distance Geometry Problem (DDGP) [13], where the Discretization axiom is 
relaxed to require that each vertex v > K has at least K adjacent predecessors. The original results in 
this paper, however, only refer to the DMDGP and its variants. 

The Discretization axiom guarantees that the locus of the points embedding v in R 3 is the intersec- 
tion of the three spheres centered at v — 3, V — 2,v — 1 with radii d„_3„, d v ~2,v, d v —i sV . If this intersection 
is non-empty, then it contains two points apart from a set of Lebesgue measure where it may contain 
either one point or infinitely many. The role of the Strict Simplex Inequalities axiom is to prevent 
the latter case of infinitely many points. As such we might actually dispense with this axiom altogether 
and simply discuss results that occur with probability 1. We remark that if the intersection of the three 
spheres is empty, then the instance is a NO one. The Discretization axiom allows the solution of 
DMDGP instances using a recursive algorithm called Branch-and-Prune (BP) [5]: at level v, the search 
is branched according to the (at most two) possible positions for v. The BP generates a (partial) binary 
search tree of height n, each full branch of which represents a feasible embedding for the given graph. 

The DMDGP and its variants are related to the Molecular Distance Geometry Problem 
(MDGP), which asks to find an embedding in R 3 of a given weighted undirected graph. We denote 
the generalization of the MDGP to embeddings in ~R K where K is part of the input by Distance Ge- 
ometry Problem (DGP), and the variants with fixed K by DGP^. The MDGP is a good model for 
determining the structure of molecules given a set of inter-atomic distances [TUl IE] ■ Such distances can 
usually be found using Nuclear Magnetic Resonance (NMR) experiments [T7], a technique which allows 
the detection of inter-atomic distances below 5A. The DGP has applications in wireless sensor networks 
[1] and graph drawing. In general, the MDGP and DGP implicitly require a search in a continuous 
Euclidean space [TO] . 

The DMDGP is a model for protein backbones. For any atom v S V, the distances d v -\ jV and d v -2.v-i 
are known because they refer to covalent bonds. Furthermore, the angle between v — 2, v — 1 and v is 
known because it is adjacent to two covalent bonds, which implies that d v ^2,v is also known by triangular 
geometry. In general, the distance d v s^ v is smaller than 5A and can therefore be assumed to be known 
by NMR experiments; in practice, there are ways to find atomic orders which ensure that d v s tV is known 
[TJ. There is currently no known protein with d v -^ >v -\ being exactly equal to d v -s tV -2 + d v —2, v —i [!]■ 

The rest of this paper is organized as follows. In Sect. [5] we describe the BP algorithm. In Sect. [3] 
we discuss complexity issues. Sect. 0] describes some polynomial DMDGP subclasses. We make several 
important contributions: an NP-hardness proof for the K DMDGP and the DMDGP*- (for K > 2), a 
new proof that the number of feasible embeddings of DMDGP instances is a power of two, and some 
practically relevant polynomial cases of the DMDGP. 



2 The BP algorithm 

For all v G V we let N(v) = {u G V \ {u, v} £ E} be the set of vertices adjacent to v. An embedding 
of a subgraph of G is called a partial embedding of G. We denote by X the set of embeddings (modulo 
congruences ) solving a DMDGP^ (or K DMDGP) instance. 

The BP algorithm exploits the edges guaranteed by the Discretization axiom in order to search 
a discrete set: vertex v can be placed in at most two possible positions (the intersection of K spheres 
in M. K ). Each is tested in turn and the procedure called recursively for each feasible positions. The 
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BP exploits all other edges in the graph in order to prune some branches: a position might be feasible 
with respect to the distances to the K immediate predecessors v — 1, . . . ,v — K, but not necessarily with 
distances to other adjacent predecessors. 

For a partial embedding x of G and {u, v} € E let S^ v be the sphere centered at x u with radius d uv . 
The BP algorithm, used for solving the DMDGP and its variants, is BP(A + 1, x' , 0) (see Alg.QJ, where 



Algorithm 1 BP(u, x, X) 

Require: A vtx. v G V \ [K], a partial embedding x = (xi,..., x„-i), a set X. 

i: p = n ^uv'j 

2: Vpe P { (x <r- (x,p)); if (v = n) X <- X U {x} else BP(v + 1, x, X) ). 



x' is the initial embedding of the first K vertices mentioned in the DMDGP definition. By the DMDGP 
axioms, |P| < 2. At termination, X contains all embeddings (modulo congruences) extending x' [HIE]. 
Embcddings x G X can be represented by sequences x( x ) £ {~ lj 1}™ with: (i) x( x )i = 1 f° r a U * < A'; 
(ii) for all i > K, x( x )i = — 1 if ax % < a o and x( x )i = 1 if a x i > ao, where ace = ao is the equation of the 
hyperplane through Xi-K, ■ ■ ■ , x i-i- For an embedding x £ X, x( x ) is t ne chirality of x [5] (the formal 
definition of chirality actually states x( x )o = if axt = ao, but since this event has probability 0, we do 
not consider it here). 

The BP (Alg. [T]) can be run to termination to find all possible embcddings of G, or stopped after the 
first leaf node at level n is reached, in order to find just one embedding of G. In the last few years we 
have conceived and described several BP variants targeting different problems 0, including, very recently, 
problems with interval- type uncertainties on some of the distance values [7]. Compared to continuous 
search algorithms (e.g. |12|). the performance of the BP algorithm is impressive from the point of view 
of both efficiency and reliability. The BP algorithm, moreover, is currently the only method able to find 
all embeddings for a given protein backbone. 



3 Complexity 

Any class of YES instances where each vertex v only has distances to the K immediate predecessors 
provides a full BP binary search tree (after level K) , and therefore shows that the BP is an exponential- 
time algorithm in the worst case. One remarkable feature of the computational experiments conducted 
on our BP implementation [15] on protein instances is that the exponential-time behaviour of the BP 
algorithm was never noticed empirically. When we were able to embed protein backbones of ten thousand 
atoms in just over 13 seconds of CPU time (on a single core) [3], we started to suspect that protein 
instances might have some special properties ensuring that the BP ran in polynomial time. Specifically, 
using the particular structure of the protein graph, we argue in Sect. 0]that it is reasonable to expect 
that the BP will yield a search tree of bounded width. 

Restricting d to only take integer values, the DGPi is NP-complete by reduction from Subset-Sum, 
the DGPif is (strongly) NP-hard by reduction from 3-SAT, and the DGP is (strongly) NP-hard by 
induction on K [TB]. Only the DGPi is NP-complete because if d is integer then the YES-certificatc 
x (the embedding) can be chosen to have integer values. It is currently not known whether there is a 
polynomial length encoding of the algebraic numbers that can be used to show that DGP is in NP. 

The DMDGP is NP-hard by reduction from Subset-Sum (Thm. 3 in [5])- We generalize that proof 
to the DMDGPk- Intuitively, we exploit the fact that a subset sum instance oi,...,ajv with solution 
si, . . . , sjy G { — 1, 1} has J2 i<N siai = (the zero-sum property) to construct a DMDGP instance with 
KN+1 points, where the zero-th point is at the origin and the £-th set of AT successive points is associated 
to ai; the j-th point in the £-th set adds sgat to its j-th coordinate, so that the last point is again the 
origin (all coordinates satisfy the Subset-Sum's zero-sum property). 
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3.1 Theorem 

The DMDGPr is NP-hard for all K > 2. 

Proof. Let a = (ai, . . . , a/v) be an instance of Subset-Sum consisting of positive integers, and define 
an instance of DMDGP^ where V = {0,...,KN}, E includes {i,i + j} for all j e {1,...,K} and 
i e {0, . . . , KN - j}, and: 

Vie{0,...,KN-l} di, i+ x = a [i/Ki (2) 
Vje{2,...,K},ie{0,...,KN-j} d i>i+j 



H),KN 



\ di+e-i,i+e (3) 
0. (4) 



Let s £ {— 1, 1}^ be a solution of the Subset-Sum instance a. We let x = and for all i = A'(£-l)+j > 
we let Xi = Xi-\ + s^a^ej, where ej is the vector with a one in component j and zero elsewhere. Because 
J2e<N s t- a z = 0, if s solves the Subset-Sum instance a then, by inspection, x solves the corresponding 
DMDGP instance ©-(IH). Conversely, let x be an embedding that solves ©-((l]), where we assume 
without loss of generality that xo — 0. Then Q ensures that the line through Xi,Xi-i is orthogonal 
to the line through Xi-\,Xi—2 for all i > 1, and again we assume without loss of generality that, for 
all j G {1, . . . , K}, the lines through Xj-±, Xj are parallel to the i-th coordinate axis. Now consider the 
chirality \ of x: because all distance segments arc orthogonal, for each j < K the j-th coordinate is given 
by x K N,j = Xi^ii/K} ■ Since d 0iK N = 0, for all j < K we have = x K nj = J2e<N XK(e-i)+ 3 aii 

i mod K=j 

which implies that, for all j < K, s J = (xK(i-i)+j \ I < (- < N) \s & solution for the Subset-Sum 
instance a. □ 

3.2 Corollary 

The K DMDGP is NP-hard. 

Proof. Every specific instance of the K DMDGP specifies a fixed value for K and hence belongs to the 
DMDGP^. Hence the result follows by inclusion. □ 



4 BP search trees with bounded width 

We partition E into the sets En = {{u, v} \ \v — u\ < K} and Ep = E\Erj. We call Erj the discretization 
edges and Ep the pruning edges. Discretization edges guarantee that a DGP instance is in the K DMDGP. 
Pruning edges are used to reduce the BP search space by pruning its tree. In practice, pruning edges 
might make the set P in Alg. [T]have cardinality or 1 instead of 2. We assume G is a YES instance of 
the K DMDGP. 



4.1 The discretization group 

Let Gd = (V,Erj,d) and Xpj be the set of embeddings of Go] since Gd has no pruning edges, the 
BP search tree for Gd is a full binary tree and \Xp>\ = 2 n ~ K . The discretization edges arrange the 
embeddings so that, at level £, there are 2 l ~ K possible embeddings x v for the vertex v with rank I. We 
assume that |P| = 2 at each level v of the BP tree, an event which, in absence of pruning edges, happens 
with probability 1 — thus many results in this section are stated with probability 1. Let x v ,x' v the 
possible embeddings of v at level v of the tree; then by elementary spherical geometry considerations, x' v 
is the reflection of x v through the hyperplane defined by x v -k, ■ ■ ■ , x v -\. Denote this reflection by R^.. 
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4.1 Theorem (Cor. 4.5 and Thm. 4.8 in [TT]) 

With probability 1, for all v > K and u < v - K there is a set H uv , with \H UV \ = 2 V ~ U ~ K , of real 
positive values such that for each x S X we have ||a:„ — x u \\ S H uv . Furthermore, Vx S X \\x v — x u \\ — 
\\R» +K (x v ) -x u \\ andVx' e X, if x' v £ {x v , R% +K (x v )} then \\x v - x u \\ ^ \\x' v - x u \\. 

Proof. Sketched in Fig.[TJfor K = 2; the circles mark equidistant levels from 1. Intuitively, two branches 
from level 1 to level 4 or 5 will have equal segments but different angles, which will cause the end dots 
to be at different distances from level 1. The formal proof is by induction on the level distance. □ 




Figure 1: A pruning edge {1,4} prunes either vq,^ or ^5,^8- 

We now define partial reflection operators: 

g v (x) = (xt, . . . ,x v -i, Rl(x v ), . . . , Rl(x n )). (5) 

The g v 's map an embedding x to its partial reflection with first branch at v. It is evident that the <7„'s 
are injective with probability 1 and idempotent. 

4.2 Lemma 

For u, v g V such that u,v > K, g u g v (x) = g v gu(x). 



Proof. Assume without loss of generality u < v. Then: 

gu9v{x) = g u (xi,...,x v -i,Rl(x v ),...,Rl(x n )) 

= (xi . . . , X u -i, R l g^( x ){x u ), ■ ■ ■ , R^ v ( x )-^x( X v)> ■ ■ ■ > Rg v (x)Rx( X n)) 
= (x\ . . . , X u —i, R x (x u ) 7 . . . , Rg u ( x )R x ( x v) j • ■ • j Rg u ( x )R x { X n)) 

9v {x \ , ■ • ■ , x u —\ , R x i^x u ),..., R x (^ n )) 
= 9vg u (x), 

where R*, x ^R%(x w ) = R^^R^Xw) for each w>vhy Lemma 4.2 in [TT]. □ 



We define the discretization group to be the group Qd = | v > K) generated by the g^s 
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4.3 Corollary 

With probability 1, Qd is an Abelian group isomorphic to K ■ 

For all v > K let 7„ = (1, . . . , 1, —l v , . . . , —1) be the vector consisting of one's in the first v—1 components 
and —1 in the last components. Then the g v actions are directly mapped onto the chirality functions. 

4.4 Lemma 

For all x £ X, x(9v(x)) = x( x ) © lv, where © is the componentwise vector multiplication. 

Proof. This follows by definition of g v and of chirality of an embedding. □ 

Because, by Alg. [TJ each x £ X has a different chirality, for all x, x' £ X there is g £ Qd such 
that x 1 = g(x), i.e. the action of Qd on X is transitive. By Thru. 14.11 the distances associated to the 
discretization edges are invariant with respect to the discretization group. 

4.2 The pruning group 

Consider a pruning edge {u,v} £ Ep. By Thm. 14. 1[ with probability 1 we have d uv € H uv , otherwise 
the instance could not be a YES one. Also, again by Thm. 14.11 d uv = \\x u — x v || ^ — g w (x) v || for 

all w £ {u + K, . . . , v} (note that distance ||^i — vg\\ in Fig.Q]is different from all its reflections \\u\ — i>h\\ 
with h £ {10, 11, 13} w.r.t. g^, g§). We therefore define the pruning group Qp = (g w \ w > K A V{m, v} £ 
Ep (w $l {u + K, . . . , v})). It is easy to show that Qp < Qd- By definition, the distances associated with 
the pruning edges are invariant with respect to Qp. 

4.5 Theorem (Thm. 5.4 in [11]) 

The action of Qp on X is transitive. 

\X\ was shown to be a power of two with probability 1 in the unpublished technical report jllj . We 
provide an shorter and clearer proof. 

4.6 Theorem 

With probability 1, 3£ £ N \X\ = 2 l . 

Proof. Since Qd — CV L ~ K , \Qd\ = 2 n ~ K . Since Qp < Qd, \Qp\ divides the order of \Gd\, which implies 
that there is an integer I with \Qp\ = 2 e . By Thm. 14.51 the action of Qp on X only has one orbit, 
i.e. Qpx = X for any x £ X. By idempotency, for g,g' £ Qp, if gx = g'x then g — g' . This implies 
\Qpx\ = \Q P \. Thus, for any x £ X, \X\ = \Q P x\ = \Q P \= 2 l . □ 



4.3 The number of nodes in function of pruning edges 

Fig. [2] shows a Directed Acyclic Graph (DAG) T> uv that we use to compute the number of valid nodes in 
function of pruning edges between two vertices u, v £ V such that v > K and u < v — K. The first line 
shows different values for the rank of v w.r.t. u; an arc labelled with an integer i implies the existence of 
a pruning edge {u + i, v} (arcs with V-expressions replace parallel arcs with different labels). An arc is 
unlabelled if there is no pruning edge v} for any w £ {u, . . . , v — K — 1}. The vertices of the DAG 
are arranged vertically by BP search tree level, and are labelled with the number of BP nodes at a given 
level, which is always a power of two by Thm. 14.61 A path in this DAG represents the set of pruning 
edges between u and v, and its incident vertices show the number of valid nodes at the corresponding 
levels. For example, following unlabelled arcs corresponds to no pruning edge between u and v and leads 
to a full binary BP search tree with 2 V ~ K nodes at level v. 
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V 'u+K-l u+K u+K+1 u+K+2 u+K+3 u+if+4 




Figure 2: Number of valid BP nodes (vertex label) at level u + K + £ (column) in function of the pruning 
edges (path spanning all columns). 

4.4 Polynomial DMDGP cases 

For a given Gd, each possible pruning edge set Ep corresponds to a path spanning all columns in T>i n . 
Instances with diagonal (Prop. 14. 7[) or below-diagonal (Prop. I4.8[) Ep paths yield BP trees with constant 
width. 

4.7 Proposition 

If 3vo > K s.t. Vw > vq 3\u < v — K with {u,v} G Ep then the BP search tree width is hounded by 

2 v o— K 

Proof. This corresponds to a path p = (1,2,..., 2 V °- K , . . . , 2 V °~ K ) that follows unlabelled arcs up to 
level vq and then arcs labelled vq — K — 1, vq — K — 1 V vq — K, and so on, leading to nodes that are all 
labelled with 2 Va ~ K (Fig. El left). □ 

4.8 Proposition 

If 3v(j > K such that every subsequence s of consecutive vertices >vq with no incident pruning edge is 
preceded by a vertex v s such that 3u s < v s (v s — u s > \s\A {u s , v s } S Ep), then the BP search tree width 
is bounded by 2 V °~ K . 

Proof. (Sketch) This situation corresponds to a below-diagonal path, Fig. [3] (right). □ 

In general, for those instances for which the BP search tree width has a 0(log??) bound, the BP has 
a polynomial worst-case running time 0(L2 log ™) = O(Ln), where L is the complexity of computing P. 
Since L is typically constant in n [3], for such cases the BP runs in linear time 0(n). 

Let V = {v e V | 31 e N (v = 2 e )}. 

4.9 Proposition 

If 3vq > K s.t. for all v G V \ V with v > vq there is u < v — K with {u, v} G Ep then the BP search 
tree width at level n is bounded by 2 v °n. 
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Figure 3: A path p with treewidth 8 (left) and another path below p (right). 



Proof. This corresponds to a path along the diagonal 2 V ° apart from logarithmically many vertices in V 
(those in V'), at which levels the BP doubles the number of search nodes (Fig. □ 
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Figure 4: A path with treewidth 0(n). 

For a pruning edge set Ep as in Prop. 14.91 or yielding a path below it, the BP runs in quadratic time 
0(n(n + l)/2) =(9(n 2 ). 

4.5 Empirical verification 

On a set of sixteen protein instances from the Protein Data Bank (PDB), twelve satisfy Prop. l4~71 and four 
Prop. 14.81 all with vq = 4. This is consistent with the computational insight [5] that BP has polynomial 
complexity on real proteins. 



5 Conclusion 



We exploit some geometrical properties of an NP-hard distance geometry problem with a specific vertex 
order to derive some polynomial cases. Empirically, proteins backbones seem to fall in these cases; this 
provides an explanation for the practical efficiency of a well-known embedding algorithm called Branch- 
and-Prune. 
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